This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.
1 - 1 of 1 items